Analyzing data using AI Platform Notebooks and BigQuery
Overview
In this lab, you analyze a large (70 million rows, 8 GB) airline dataset using BigQuery and AI Platform Notebooks.
What you learn
In this lab, you:
- Launch AI Platform Notebooks.
- Invoke a BigQuery query.
- Create graphs in AI Platform Notebooks.
This lab illustrates how you can explore large datasets but continue to use familiar tools like Pandas and Juypter. The "trick" is to do the first part of your aggregation in BigQuery, have a Pandas dataset returned, and then work with the smaller Pandas dataset locally. AI Platform Notebooks provides a managed Jupyter experience, so you don't need to run notebook servers yourself. For more information about how to visualize BigQuery data in a Jupyter notebook, see Visualizing BigQuery data in a Jupyter notebook.
Setup
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
-
Make sure you signed into Qwiklabs using an incognito window.
-
Note the lab's access time (for example,
and make sure you can finish in that time block.
-
When ready, click
.
-
Note your lab credentials. You will use them to sign in to the Google Cloud Console.
-
Click Open Google Console.
-
Click Use another account and copy/paste credentials for this lab into the prompts.
- Accept the terms and skip the recovery resource page.
Deployment Manager
This lab uses a Cloud Deployment Manager script to create the Cloud AI Platform instance you will need for this exercise. The instance should be ready in 2 or 3 minutes.
Please wait before launching the Jupyter notebook; otherwise, the script might be interrupted and the repository might not be cloned.
Invoke BigQuery
Open BigQuery Console
In the Google Cloud Console, select Navigation menu > BigQuery:
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
Click Done.
The BigQuery console opens.
Step 1
In the Query editor, type:
#standardSQL
SELECT
departure_delay,
COUNT(1) AS num_flights,
APPROX_QUANTILES(arrival_delay, 5) AS arrival_delay_quantiles
FROM
`bigquery-samples.airline_ontime_data.flights`
GROUP BY
departure_delay
HAVING
num_flights > 100
ORDER BY
departure_delay ASC
Click Run.
What is the median early arrival for flights that left 35 minutes early?
(Answer: the typical flight that left 35 minutes early arrived 28 minutes early.)
Step 2 (Optional)
Can you write a query to find the airport pair (departure and arrival airport) that had the maximum number of flights between them?
One possible solution:
#standardSQL
SELECT
departure_airport,
arrival_airport,
COUNT(1) AS num_flights
FROM
`bigquery-samples.airline_ontime_data.flights`
GROUP BY
departure_airport,
arrival_airport
ORDER BY
num_flights DESC
LIMIT
10
Draw graphs in AI Platform Notebooks
Step 1
In the Google Cloud Console, on the Navigation Menu, click AI Platform > Notebooks.
Step 2
For the python-notebook instance, click Open JupyterLab.
Step 3
In JupyterLab, to start a new notebook, click Notebook > Python 3.
Step 4
In the first cell of the notebook, to install BigQuery version 1.25.0, type the following, and then click Run.
!pip install google-cloud-bigquery==1.25.0
Step 5
In the next cell in the notebook, type the following, and then click Run.
query="""
SELECT
departure_delay,
COUNT(1) AS num_flights,
APPROX_QUANTILES(arrival_delay, 10) AS arrival_delay_deciles
FROM
`bigquery-samples.airline_ontime_data.flights`
GROUP BY
departure_delay
HAVING
num_flights > 100
ORDER BY
departure_delay ASC
"""
from google.cloud import bigquery
df = bigquery.Client().query(query).to_dataframe()
df.head()
Note that the results from BigQuery are returned as a Pandas dataframe.
What Python data structure are the deciles in?
Step 6
In the next cell in the notebook, type the following, and then click Run.
import pandas as pd
percentiles = df['arrival_delay_deciles'].apply(pd.Series)
percentiles = percentiles.rename(columns = lambda x : str(x*10) + "%")
df = pd.concat([df['departure_delay'], percentiles], axis=1)
df.head()
What did this code do to the columns in the Pandas dataframe?
Step 7
In the next cell in the notebook, type the following, and then click Run.
without_extremes = df.drop(['0%', '100%'], 1)
without_extremes.plot(x='departure_delay', xlim=(-30,50), ylim=(-50,50));
If you were creating a machine learning model to predict the arrival delay of a flight, would a departure delay be a good input feature? Is this true at all ranges of departure delays?
Summary
In this lab, you learned how to carry out data exploration of large datasets using BigQuery, Pandas, and Jupyter. For more information about working with the Pandas library, see Pandas: How to Read and Write Files.
End your lab
When you have completed your lab, click End Lab. Qwiklabs removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
- 1 star = Very dissatisfied
- 2 stars = Dissatisfied
- 3 stars = Neutral
- 4 stars = Satisfied
- 5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
©2020 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.